Bollywood_India Analysis¶
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
DATA_FOLDER = '../../final/'
def var_loader(DATA_FOLDER, mode='hollywood'):
results = []
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"{mode}_data.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"{mode}_data_ethnicity.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"{mode}_ethnic_realworld.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"male_{mode}_realworld_averages.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"female_{mode}_realworld_averages.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"bothsexes_{mode}_realworld_averages.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"male_{mode}_realworld_proportions.csv"))
results.append(pd.read_csv(DATA_FOLDER + f"{mode}/"+ f"female_{mode}_realworld_proportions.csv"))
return results
# Charging the Bollywood datasets in their respective dataframes
bollywood_data, bollywood_data_ethnicity, bollywood_ethnic_realworld, \
male_bollywood_realworld_averages, female_bollywood_realworld_averages, \
bothsexes_bollywood_realworld_averages, male_bollywood_realworld_proportions, \
female_bollywood_realworld_proportions = var_loader(DATA_FOLDER, mode="bollywood")
Ethnicities¶
Observations¶
South Indian Ethnicities (Overrepresented in Movies):
- South Indian ethnicities are significantly overrepresented in movies compared to their real-world population proportion.
- This may suggest a strong influence of the South Indian film industry (like Tamil, Telugu, and Malayalam cinema) on the dataset, as these industries are major contributors to Indian cinema.
Eastern Indian Ethnicities (Underrepresented):
- Eastern Indian ethnicities are underrepresented in the movies.
This aligns with the observation that cinema from Eastern India (like Bengali cinema) is smaller in scale and has less influence nationally compared to South Indian or Bollywood films.
Western and Central Indian Ethnicities (Closer Representation):
- Western and Central Indian ethnicities appear to have a closer proportion between real-world data and movies.
- Bollywood, which is centered in Mumbai (Maharashtra), likely plays a role here since it represents Western and
Religious and Caste Groups (Underrepresented):
- Groups tied to religious and caste identities appear underrepresented in movies.
This could be due to the general trend of movies avoiding direct references to caste or religion to maintain neutrality or appeal to a wider audience.
- The underrepresentation of caste-based groups could indicate systemic biases in casting and storytelling, where mainstream cinema does not reflect India's real diversity.
This analysis might help validate if certain ethnicities dominate in terms of box office revenue and provide insights into the market dynamics of the Indian film industry:¶
# Step 1: Calculate the majority ethnicity for each movie
# Group the data by movie (using wiki_movie_id) and determine the majority ethnicity
def get_majority_ethnicity(series):
# Exclude "Unknown" from the series
filtered_series = series[series != 'Unknown']
# Check if the filtered series is not empty
if not filtered_series.empty:
return filtered_series.value_counts().idxmax() # Get the most frequent ethnicity
else:
return 'Unknown' # Return "Unknown" if all entries are "Unknown"
bollywood_data['actor_ethnicity_classification'] = bollywood_data['actor_ethnicity_classification'].fillna('Unknown')
# Group by movie and assign the majority ethnicity to each movie
bollywood_data['majority_ethnicity'] = bollywood_data.groupby('wiki_movie_id')['actor_ethnicity_classification'].transform(get_majority_ethnicity)
filtered_bollywood_data = bollywood_data[bollywood_data['majority_ethnicity'] != 'Unknown']
# Reset the index if necessary
filtered_bollywood_data.reset_index(drop=True, inplace=True)
# Display the filtered data
print(filtered_bollywood_data['majority_ethnicity'].value_counts())
majority_ethnicity South_Indian_Ethnicities 13629 North_Indian_Ethnicities 10984 Western_and_Central_Indian_Ethnicities 5766 Eastern_Indian_Ethnicities 3495 Religious_and_Caste_Groups 953 Name: count, dtype: int64
filtered_bollywood_data.head()
| wiki_movie_id | movie_name | box_office | release_y | actor_gender | actor_name | age_at_release | actor_birth_year | countries_label | genres_label | actor_ethnicity_label | main_genre | actor_ethnicity_classification | majority_ethnicity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 232301.0 | Swathi Muthyam | 17768757.0 | 1985.0 | M | Kamal Haasan | 30.0 | 1954.0 | ['India'] | ['Family Film', 'Drama'] | Tamil Brahmin | Animation/Family | South_Indian_Ethnicities | South_Indian_Ethnicities |
| 1 | 232301.0 | Swathi Muthyam | 17768757.0 | 1985.0 | F | Raadhika Sarathkumar | 22.0 | 1962.0 | ['India'] | ['Family Film', 'Drama'] | Sri Lankan Tamils | Animation/Family | South_Indian_Ethnicities | South_Indian_Ethnicities |
| 2 | 232301.0 | Swathi Muthyam | 17768757.0 | 1985.0 | F | Dubbing Janaki | NaN | NaN | ['India'] | ['Family Film', 'Drama'] | Unknown | Animation/Family | Unknown | South_Indian_Ethnicities |
| 3 | 232325.0 | Swathi Kiranam | 17768757.0 | 1992.0 | F | Raadhika Sarathkumar | 29.0 | 1962.0 | ['India'] | ['Family Film', 'Drama', 'Musical'] | Sri Lankan Tamils | Animation/Family | South_Indian_Ethnicities | South_Indian_Ethnicities |
| 4 | 232325.0 | Swathi Kiranam | 17768757.0 | 1992.0 | M | Mammootty | 40.0 | 1951.0 | ['India'] | ['Family Film', 'Drama', 'Musical'] | Malayali | Animation/Family | South_Indian_Ethnicities | South_Indian_Ethnicities |
# Step 2: Extract unique rows for movies (since majority ethnicity will be the same for each actor of a movie)
movies_majority_ethnicity = filtered_bollywood_data[['wiki_movie_id', 'majority_ethnicity', 'box_office']].drop_duplicates()
# Step 3: Calculate the average box office revenue for each majority ethnicity
average_revenue_by_ethnicity = movies_majority_ethnicity.groupby('majority_ethnicity')['box_office'].mean().reset_index()
average_revenue_by_ethnicity.columns = ['Ethnicity', 'Average Box Office Revenue']
# Sort the dataframe by 'Average Box Office Revenue' in descending order
average_revenue_by_ethnicity = average_revenue_by_ethnicity.sort_values(by='Average Box Office Revenue', ascending=False).reset_index(drop=True)
average_revenue_by_ethnicity.head()
| Ethnicity | Average Box Office Revenue | |
|---|---|---|
| 0 | Eastern_Indian_Ethnicities | 1.776876e+07 |
| 1 | Western_and_Central_Indian_Ethnicities | 1.773630e+07 |
| 2 | North_Indian_Ethnicities | 1.772318e+07 |
| 3 | South_Indian_Ethnicities | 1.766447e+07 |
| 4 | Religious_and_Caste_Groups | 1.763779e+07 |
# Step 4: Plot the results
# Create the horizontal bar chart with adjusted y-axis
fig = px.bar(
average_revenue_by_ethnicity,
x='Average Box Office Revenue',
y='Ethnicity',
orientation='h', # Horizontal orientation
title="Average Box Office Revenue by Majority Ethnicity of Actors",
labels={'Average Box Office Revenue': 'Average Box Office Revenue ($)', 'Ethnicity': 'Majority Ethnicity'},
text='Average Box Office Revenue' # Add text labels
)
# Adjust the layout for a zoomed-in view
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
xaxis=dict(range=[1.76e7, 1.78e7]), # Set a narrow range for better visualization
showlegend=False
)
fig.show()
# Save the figure as an HTML file
output_file = "average_Box_Office_revenue_by_ethnicity.html"
fig.write_html(output_file)
print(f"Figure saved as {output_file}")
Figure saved as average_Box_Office_revenue_by_ethnicity.html
While we expected movies with a majority of South Indian ethnicities to have the highest average box office revenue, the data indicates otherwise.¶
Gender¶
Observations¶
Across most genres, Bollywood exhibits a clear male dominance in representation:
- Action/Adventure and Thriller/Suspense show the highest male representation (69.1% and 68.8% male, respectively).
Cultural and Social Reflections: The male-dominated representation might reflect broader societal attitudes in India during the time frame of the data, emphasizing traditional gender roles. The slightly improved representation in Fantasy and Sci-Fi might align with modern, progressive storytelling in these genres.
# Reload the Bollywood datasets for fresh un-altered data
bollywood_data, bollywood_data_ethnicity, bollywood_ethnic_realworld, \
male_bollywood_realworld_averages, female_bollywood_realworld_averages, \
bothsexes_bollywood_realworld_averages, male_bollywood_realworld_proportions, \
female_bollywood_realworld_proportions = var_loader(DATA_FOLDER, mode="bollywood")
# Step 1: Group actors by movie to calculate gender proportions, ignoring NaNs
def calculate_gender_proportions(df):
# Drop rows with NaN in actor_gender
df = df.dropna(subset=['actor_gender'])
# Calculate gender proportions
gender_counts = df.groupby('wiki_movie_id')['actor_gender'].value_counts(normalize=True).unstack(fill_value=0)
return gender_counts
# Calculate gender proportions
gender_proportions = calculate_gender_proportions(bollywood_data)
gender_proportions.head()
| actor_gender | F | M |
|---|---|---|
| wiki_movie_id | ||
| 30674.0 | 0.363636 | 0.636364 |
| 222328.0 | 0.500000 | 0.500000 |
| 232291.0 | 0.571429 | 0.428571 |
| 232301.0 | 0.666667 | 0.333333 |
| 232325.0 | 0.333333 | 0.666667 |
# Step 2: Drop movies with all NaN genders (those that would not appear in gender_counts)
bollywood_data = bollywood_data[bollywood_data['wiki_movie_id'].isin(gender_proportions.index)]
# Step 3: Label movies based on gender proportions
def label_gender_dominance(row):
if row.get('F', 0) >= 0.5:
return 'Female-Dominated'
elif row.get('M', 0) >= 0.5:
return 'Male-Dominated'
else:
return 'Balanced'
gender_proportions['Gender Dominance'] = gender_proportions.apply(label_gender_dominance, axis=1)
display(gender_proportions.head())
# Merge gender dominance labels back to the main data
bollywood_data = bollywood_data.merge(
gender_proportions[['Gender Dominance']],
left_on='wiki_movie_id',
right_index=True,
how='left'
)
bollywood_data.head(1)
| actor_gender | F | M | Gender Dominance |
|---|---|---|---|
| wiki_movie_id | |||
| 30674.0 | 0.363636 | 0.636364 | Male-Dominated |
| 222328.0 | 0.500000 | 0.500000 | Female-Dominated |
| 232291.0 | 0.571429 | 0.428571 | Female-Dominated |
| 232301.0 | 0.666667 | 0.333333 | Female-Dominated |
| 232325.0 | 0.333333 | 0.666667 | Male-Dominated |
| wiki_movie_id | movie_name | box_office | release_y | actor_gender | actor_name | age_at_release | actor_birth_year | countries_label | genres_label | actor_ethnicity_label | main_genre | actor_ethnicity_classification | Gender Dominance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30674.0 | The Terrorist | 17768757.0 | 1999.0 | F | Bhavani | NaN | NaN | ['India'] | ['Thriller', 'World cinema', 'Drama', 'Politic... | Unknown | Drama | NaN | Male-Dominated |
# Count the number of female-dominated and male-dominated movies
dominance_counts = gender_proportions['Gender Dominance'].value_counts()
# Display the counts
print(dominance_counts)
Gender Dominance Male-Dominated 5228 Female-Dominated 2249 Name: count, dtype: int64
# Data preparation
dominance_counts = gender_proportions['Gender Dominance'].value_counts().reset_index()
dominance_counts.columns = ['Gender Dominance', 'Number of Movies']
# Plotting with Plotly
fig = px.bar(
dominance_counts,
x='Gender Dominance',
y='Number of Movies',
color='Gender Dominance',
text='Number of Movies',
color_discrete_map={'Male-Dominated': 'steelblue', 'Female-Dominated': 'lightcoral'},
title='Number of Male-Dominated vs Female-Dominated Movies'
)
# Customizing layout
fig.update_traces(textposition='outside', marker_line_color='black', marker_line_width=1.5)
fig.update_layout(
title_font_size=18,
xaxis_title='Gender Dominance',
yaxis_title='Number of Movies',
xaxis=dict(tickangle=0),
template='plotly_white'
)
# Show the plot
fig.show()
# Save the figure as an HTML file
output_file = "Male-Dominated_vs_Female-Dominated_Movies.html"
fig.write_html(output_file)
print(f"Figure saved as {output_file}")
Figure saved as Male-Dominated_vs_Female-Dominated_Movies.html
Female Representation: Although female-dominated movies exist, their lower numbers reflect the gender disparity. It aligns with industry trends where male-centric narratives dominate cinema. This should open discussions about improving gender equity in casting and storytelling.¶
# Step 4: Calculate average box office revenue for each category
average_revenue_by_gender = bollywood_data.groupby('Gender Dominance')['box_office'].mean().reset_index()
average_revenue_by_gender.columns = ['Gender Dominance', 'Average Box Office Revenue']
average_revenue_by_gender = average_revenue_by_gender.sort_values(by= 'Average Box Office Revenue', ascending= False).reset_index(drop=True)
average_revenue_by_gender.head()
| Gender Dominance | Average Box Office Revenue | |
|---|---|---|
| 0 | Male-Dominated | 1.772160e+07 |
| 1 | Female-Dominated | 1.769631e+07 |
# Plot the average box office revenue by gender dominance
# Create the horizontal bar chart with adjusted y-axis
fig = px.bar(
average_revenue_by_gender,
x='Average Box Office Revenue',
y='Gender Dominance',
orientation='h', # Horizontal orientation
title="Average Box Office Revenue by Dominating Gender in Movie",
labels={'Average Box Office Revenue': 'Average Box Office Revenue ($)', 'Gender Dominance': 'Gender Dominance'},
text='Average Box Office Revenue' # Add text labels
)
# Adjust the layout for a zoomed-in view
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
xaxis=dict(range=[1.76e7, 1.78e7]), # Set a narrow range for better visualization
showlegend=False
)
fig.show()
# Save the figure as an HTML file
output_file = "Average_Box_Office_Revenue_by_Dominating_Gender.html"
fig.write_html(output_file)
print(f"Figure saved as {output_file}")
Figure saved as Average_Box_Office_Revenue_by_Dominating_Gender.html
The plot reveals that movies dominated by male actors have a slightly higher average box office revenue compared to those dominated by female actors. However, the difference is minimal¶
Let's analyse if the number of female actresses grew throughout the years to see if there was an effort conducted to promote equality between genders:¶
# Reload the Bollywood datasets for fresh un-altered data
bollywood_data, bollywood_data_ethnicity, bollywood_ethnic_realworld, \
male_bollywood_realworld_averages, female_bollywood_realworld_averages, \
bothsexes_bollywood_realworld_averages, male_bollywood_realworld_proportions, \
female_bollywood_realworld_proportions = var_loader(DATA_FOLDER, mode="bollywood")
# Step 1: Remove rows with NaN values in 'actor_gender' or 'release_y'
valid_data = bollywood_data.dropna(subset=['actor_gender', 'release_y'])
valid_data.head(1)
| wiki_movie_id | movie_name | box_office | release_y | actor_gender | actor_name | age_at_release | actor_birth_year | countries_label | genres_label | actor_ethnicity_label | main_genre | actor_ethnicity_classification | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30674.0 | The Terrorist | 17768757.0 | 1999.0 | F | Bhavani | NaN | NaN | ['India'] | ['Thriller', 'World cinema', 'Drama', 'Politic... | Unknown | Drama | NaN |
# Step 2: Count the number of female actors per year
female_actor_counts = valid_data[valid_data['actor_gender'] == 'F'].groupby('release_y').size().reset_index(name='Female Actor Count')
# Create the plot
fig = px.line(
female_actor_counts,
x='release_y',
y='Female Actor Count',
title='Trend of Female Actors in Indian Movies Over the Years',
labels={'release_y': 'Year', 'F': 'Count of Female Actors'},
markers=True
)
# Update x-axis ticks to display at regular intervals (e.g., every 10 years)
fig.update_layout(
xaxis=dict(
tickmode='linear',
tick0=1900, # Start at a logical year, e.g., 1900
dtick=10 # Set tick interval (e.g., every 10 years)
)
)
# Show the plot
fig.show()
# Save the figure as an HTML file
output_file = "Number_Female_Actors_Over_the_Years.html"
fig.write_html(output_file)
print(f"Figure saved as {output_file}")
Figure saved as Number_Female_Actors_Over_the_Years.html
Observation:¶
Overall Growth:
- From the early 1900s to the early 2010s, there's a noticeable increase in the number of female actors. This aligns with the overall growth of the Indian film industry, increasing the number of movies produced each year. The rise in opportunities for female actors also reflects societal progress and changing gender norms.
Sudden Drop Post-2010:
- The sharp decline after 2010 is likely not representative of an actual drop in female actors in Indian movies. Instead, it could reflect the limitations of the dataset:
- The dataset stops at 2014, and recent years often have incomplete metadata since movies released close to the dataset's creation might not yet be fully documented. Movies from 2012-2014 may not have had sufficient time to gain recognition or enter the dataset.
The trends demonstrate the increasing presence of female actors over time, reflecting the growth and diversification of the Indian film industry.¶
Age¶
This plot highlights a strong focus on youth in Indian cinema, with actors in their early 20s dominating the industry while older age groups remain significantly underrepresented. It emphasizes the disconnect between the on-screen portrayal of age and real-world demographics.
Cultural Preferences: The preference for youth in Bollywood is influenced by cultural norms that idealize youth and beauty, particularly for on-screen roles.